17 research outputs found

    Conversion of a Russian dependency treebank into HPSG derivations

    Get PDF
    Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories. Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti. NEALT Proceedings Series, Vol. 9 (2010), 7-18. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15891

    Language models, surprisal and fantasy in Slavic intercomprehension

    Get PDF
    In monolingual human language processing, the predictability of a word given its surrounding sentential context is crucial. With regard to receptive multilingualism, it is unclear to what extent predictability in context interplays with other linguistic factors in understanding a related but unknown language – a process called intercomprehension. We distinguish two dimensions influencing processing effort during intercomprehension: surprisal in sentential context and linguistic distance. Based on this hypothesis, we formulate expectations regarding the difficulty of designed experimental stimuli and compare them to the results from think-aloud protocols of experiments in which Czech native speakers decode Polish sentences by agreeing on an appropriate translation. On the one hand, orthographic and lexical distances are reliable predictors of linguistic similarity. On the other hand, we obtain the predictability of words in a sentence with the help of trigram language models. We find that linguistic distance (encoding similarity) and in-context surprisal (predictability in context) appear to be complementary, with neither factor outweighing the other, and that our distinguishing of these two measurable dimensions is helpful in understanding certain unexpected effects in human behaviour

    Cross-Domain Adaptation of Spoken Language Identification for Related Languages: The Curious Case of Slavic Languages

    Full text link
    State-of-the-art spoken language identification (LID) systems, which are based on end-to-end deep neural networks, have shown remarkable success not only in discriminating between distant languages but also between closely-related languages or even different spoken varieties of the same language. However, it is still unclear to what extent neural LID models generalize to speech samples with different acoustic conditions due to domain shift. In this paper, we present a set of experiments to investigate the impact of domain mismatch on the performance of neural LID systems for a subset of six Slavic languages across two domains (read speech and radio broadcast) and examine two low-level signal descriptors (spectral and cepstral features) for this task. Our experiments show that (1) out-of-domain speech samples severely hinder the performance of neural LID models, and (2) while both spectral and cepstral features show comparable performance within-domain, spectral features show more robustness under domain mismatch. Moreover, we apply unsupervised domain adaptation to minimize the discrepancy between the two domains in our study. We achieve relative accuracy improvements that range from 9% to 77% depending on the diversity of acoustic conditions in the source domain.Comment: To appear in INTERSPEECH 202

    On the Correlation of Context-Aware Language Models With the Intelligibility of Polish Target Words to Czech Readers

    Get PDF
    This contribution seeks to provide a rational probabilistic explanation for the intelligibility of words in a genetically related language that is unknown to the reader, a phenomenon referred to as intercomprehension. In this research domain, linguistic distance, among other factors, was proved to correlate well with the mutual intelligibility of individual words. However, the role of context for the intelligibility of target words in sentences was subject to very few studies. To address this, we analyze data from web-based experiments in which Czech (CS) respondents were asked to translate highly predictable target words at the final position of Polish sentences. We compare correlations of target word intelligibility with data from 3-g language models (LMs) to their correlations with data obtained from context-aware LMs. More specifically, we evaluate two context-aware LM architectures: Long Short-Term Memory (LSTMs) that can, theoretically, take infinitely long-distance dependencies into account and Transformer-based LMs which can access the whole input sequence at the same time. We investigate how their use of context affects surprisal and its correlation with intelligibility

    grammatical relations, and diathetic paradigm

    No full text
    Three syntactic representation levels are de fact

    Arguments, Grammatical Relations, and Diathetic Paradigm

    No full text
    this paper argues for the general notion of dependents in HPSG, in addition to arguments and subcategorized elements (valence). It attempts to provide a systematic inventory of ARG-ST / DEPS mappings which results in a diathetic paradigm. The approach offers 41 an insightful cross-linguistic and cross-constructional perspective. It is important to realize that DEPS is not only an enriched level of argument structure, it is a part of diathesis of predicator

    Proceedings of COLING’2000, Vol.1, pages 28-34 An ontology of systematic relations for a shared grammar of Slavic

    No full text
    Sharing portions of grammars across languages greatly reduces the costs of multilingual grammar engineering. Related languages share a much wider range of linguistic information than typically assumed in standard multilingual grammar architectures. Taking grammatical relatedness seriously, we are particularly interested in designing linguistically motivated grammatical resources for Slavic languages to be used in applied and theoretical computational linguistics. In order to gain the perspective of a language-family oriented grammar design, we consider an array of systematic relations that can hold between syntactical units. While the categorisation of primitive linguistic entities tends to be language-specific or even construction-specific, the relations holding between them allow various degrees of abstraction. On the basis of Slavic data, we show how a domain ontology conceptualising morphosyntactic "building blocks " can serve as a basis of a shared grammar of Slavic

    Gaining the Perspective of a Language Family Oriented Grammar Design: Special Predicative Clitics in Slavic *

    No full text
    On the abstraction level of shared grammar, a common Slavic inventory of special predicative clitics can be postulated in terms of feature specifications referring to information on TYPE, CASE and INDEX (the latter encompassing person, number and gender). This inventory is subject to parameterisation across Slavic languages. We argue that such an approach can contribute considerably to formalising clitic typology.
    corecore